This notebook and the contents of this directory were copied from https://github.com/genepattern/single_cell_clustering_notebook.git and adapted for running on ScienceData.
Authors - Clarence K. Mah, Alex T. Wenzel, and Edwin F. Juárez
Contact Email - ejuarez@ucsd.edu
This notebook is described in https://f1000research.com/articles/7-1306/v2
The goal of this notebook is to provide a standard single-cell RNA-seq analysis workflow for pre-processing, identifying sub-populations of cells by clustering, and exploring biomarkers to explain intra-population heterogeneity. The workflow is modeled after the Seurat Guided Clustering Tutorial[[1]](https://www.nature.com/articles/nbt.3192) and performs all analyses using the scanpy[[2]](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-017-1382-0) library.
Single-cell RNA sequencing (scRNA-seq) has emerged as a popular method to profile gene expression at the resolution of individual cells. Compared to traditional RNA-seq collected from bulk cells or tissue, scRNA-seq enables users to capture cell-by-cell transcriptomic variability. This information can then be used to define and characterize heterogeneity within a population of cells, from identifying known cell types to discovering novel ones. Here we examine a peripheral blood mononuclear cells (PBMCs) dataset, which consists of many immune cell types including T-cells, B-cells, NK-cells, macrophages, and dendritic cells.
This notebook analyzes a dataset of 3K Peripheral Blood Mononuclear Cells (PBMCs) from a Healthy Donor available from 10X Genomics, sequenced on the Illumina NextSeq 500.
Filter cells based on QC metrics.
Perform log scaling and normalization to total counts.
Remove unwanted sources of variation (number of detected molecules per cell as well as the percentage mitochondrial gene content).
Detect highly variable genes.
Perform linear dimensional reduction (PCA).
.csv
files or a compressed .h5ad
format.First set up the needed environment. We disable ssl certificate check in case you're gonna work with your data on ScienceData over the private network, i.e. connect to https://sciencedata/files/.
#%load_ext autoreload
#%autoreload 2
import os
import subprocess
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
import requests
def download(url):
fileName = os.path.basename(url)
if fileName and not os.path.isfile(fileName):
subprocess.check_output(['wget',url])
def upload(file, folderUrl):
requests.put(file, folderUrl+'/'+file)
#scriptUrl = 'https://raw.githubusercontent.com/genepattern/single_cell_clustering_notebook/master/singlecell.py'
scriptUrl = 'https://sciencedata.dk/public/6e3ed434c0fa43df906ce2b6d1ba9fc6/single_cell_clustering_notebook/singlecell.py'
requirementsUrl = 'https://sciencedata.dk/public/6e3ed434c0fa43df906ce2b6d1ba9fc6/single_cell_clustering_notebook/requirements.txt'
download(scriptUrl)
download(requirementsUrl)
os.system("pip show louvain 2>/dev/null > /dev/null; if [ $? -ne 0 ]; then pip install setuptools==57.5.0; pip install -r requirements.txt; fi");
Load raw count data for a single-cell RNA-seq experiment.
This notebook accepts either a single matrix file that includes gene and cell identifiers (as the first column and row respectively) or the three-file output from the 10X Genomics single-cell pipeline.
The single matrix file can be in the following formats: csv
, txt
, tsv
, tab
, data
, h5
, h5ad
, loom
For data with 2,700 cells and 33,000 genes, this step will run in 10 to 20 seconds (depending on Internet speed)
Below you can either download data from 10x, unpacked pbmc3k_filtered_gene_bc_matrices.tar.gz
and uploaded filtered_gene_bc_matrices
to a non-public folder tmp
on ScienceData; then uncomment the relevant line. Or you can leave as is and use provided example data.
# Uncomment this to use public example data from ScienceData
data_folder = "https://sciencedata.dk/public/6e3ed434c0fa43df906ce2b6d1ba9fc6/single_cell_clustering_notebook/data/"
# Uncomment this to use data you've downloaded to ScienceData from 10x
#data_folder = "https://sciencedata/files/tmp/"
# Uncomment this to use example data from the notebook authors
#data_folder = "https://github.com/genepattern/single_cell_clustering_notebook/raw/master/data/"
matrixDataFile = ""
MTXDataFile = data_folder+"matrix.mtx"
geneNameFile = data_folder+"genes.tsv"
barCodesFile = data_folder+"barcodes.tsv"
# Workaround for bug in GPNB 0.7.2
from IPython.display import display, Javascript
display(Javascript("""
setTimeout(function() {
$(".gp-widget-call").each(function(i, e) {
const widget = $(e).data("widget");
widget.options.append_output = false;
})
}, 1000);
"""))
import nbtools
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=RuntimeWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning)
import sys
sys.path.append('.')
import genepattern
from singlecell import SingleCellAnalysis, sc
import logging
logging.captureWarnings(True)
sc.settings.verbosity = 0
logging.disable(logging.INFO)
%matplotlib inline
# Workaround for bug in GPNB 0.7.2
from IPython.display import display, Javascript
display(Javascript("""
setTimeout(function() {
$(".gp-widget-call").each(function(i, e) {
const widget = $(e).data("widget");
widget.options.append_output = false;
})
}, 1000);
"""))
analysis = SingleCellAnalysis()
genepattern.GPUIBuilder(
analysis.setup_analysis,
function_import='analysis.setup_analysis',
name='Setup Analysis',
parameters={
'csv_filepath': {
'name': 'matrix data file',
'description': 'Provide your data file either as a URL or local file path. See above instructions for supported formats.',
'default': matrixDataFile,
'type': 'file',
'kinds': ['csv', 'xlsx', 'txt', 'tsv', 'tab', 'data', 'h5', 'h5ad', 'soft.gz', 'txt.gz', 'anndata', 'mtx', 'zip'],
},
'gene_x_cell': {
'name': 'Matrix dimension names',
'description': 'If the rows of your matrix are genes, choose "gene by cell", otherwise choose "cell by gene"',
'default': True,
'type': 'choice',
'choices': {
"gene by cell": True,
"cell by gene": False
}
},
'mtx_filepath': {
'name': '10x .mtx data file',
'description': 'If you have data from 10X, upload or link your .mtx file here.',
'default': MTXDataFile,
'choices': {'2700 PBMCs from a Healthy Donor (example) - mtx': 'https://github.com/genepattern/single_cell_clustering_notebook/raw/master/data/matrix.mtx'},
'type': 'file',
'kinds': ['mtx', 'zip']
},
'gene_filepath': {
'name': '10x gene name file',
'description': 'If you have data from 10X, upload or link your genes.tsv file here',
'default': geneNameFile,
'choices': {'2700 PBMCs from a Healthy Donor (example) - genes': 'https://github.com/genepattern/single_cell_clustering_notebook/raw/master/data/genes.tsv'},
'type': 'file',
'kinds': ['tsv', 'zip']
},
'bc_filepath': {
'name': '10x barcodes file',
'description': 'If you have data from 10X, upload or link your barcodes.tsv file here',
'default': barCodesFile,
'choices': {'2700 PBMCs from a Healthy Donor (example) - barcodes': 'https://github.com/genepattern/single_cell_clustering_notebook/raw/master/data/barcodes.tsv'},
'type': 'file',
'kinds': ['tsv', 'zip']
},
'output_var': {
'hide': True
}
}
)
Perform cell quality control by filtering based on quality metrics, log normalizing counts, scaling by total counts, and correcting for effects of total counts per cell and the percentage of mitochondrial genes expressed. Finally, detect highly variable genes and perform linear dimensional reduction (PCA).
For 2,700 cells and 33,000 genes, this step will run in approximately 10 to 20 seconds
genepattern.GPUIBuilder(
analysis.preprocess_counts,
function_import='analysis.preprocess_counts',
name='Preprocess Counts',
parameters={
'data': {
'description': 'Output from the "Setup Analysis" tool.',
'default': 'analysis'
},
'min_n_cells': {
'name': 'filter genes (min. # of cells)',
'description': 'Include genes expressed in at least this many cells. Blank will be treated as 0.',
'type': 'number',
'default': 3
},
'min_n_genes': {
'name': 'filter cells (min. # of genes)',
'description': 'Include cells with at least this many genes. Blank will be treated as 0.',
'type': 'number',
'default': 200
},
'max_n_genes': {
'name': 'filter cells (max # of genes)',
'description': 'Include cells with at most this many genes. Blank will be treated as no maximum value.',
'type': 'number',
'default': 1700
},
'min_n_counts': {
'name': 'filter cells (min. total counts)',
'description': 'Include cells with at least this many counts. Blank will be treated as 0.',
'type': 'number',
'default': 0
},
'max_n_counts': {
'name': 'filter cells (max total counts)',
'description': 'Include cells with at most this many counts. Blank will be treated as no maximum value.',
'type': 'number',
'default': 5700
},
'min_percent_mito': {
'name': 'filter cells (min. % mito. genes)',
'description': 'Include cells with at least this % of reads mapped to mitochondrial genes. Blank will be treated as 0.',
'type': 'number',
'default': 0
},
'max_percent_mito': {
'name': 'filter cells (max % mito. genes)',
'description': 'Include cells with at most this % of reads mapped to mitochondrial genes. Blank will be treated as no maximum value.',
'type': 'number',
'default': 6
},
'normalization_method': {
'name': 'log normalize',
'description': 'Perform log normalization on the data.',
'choices': {'Yes': 'LogNormalize', 'No': ''}
},
'do_regression': {
'name': 'perform regression',
'description': 'Regress out effect of cell counts and percent mitochondrial genes',
'choices': {'Yes': True, 'No': False}
},
'output_var': {
'hide': True
}
})
Cluster cells using the Louvain method for community detection on `N` principal components. Then use t-SNE to visualize cells, using `N` principal components.
It is important to note that the fewer PCs we choose to use, the less noise we have when clustering, but at the risk of excluding relevant biological variance. Look at the plot in Step 2 showing the percent variance explained by each principle component and choose a cutoff where there is a clear elbow in the graph. The N
principal components to the left of this cutoff are then used to cluster the cells.
genepattern.GPUIBuilder(
analysis.cluster_cells,
function_import='analysis.cluster_cells',
name='Open Cell Clustering Interface',
description='This step outputs an interactive interface to cluster cells.',
parameters={
'output_var': {
'hide': True
}
})
Visualization the expression of markers on the clustering plot.
Explore genes that are differentially expressed in clusters as potential biomarkers. We provide the Wilcoxon rank-sum test (default) and t-test as test options. The non-parametric Wilcoxon rank-sum test generally performs better for single-cell data since single-cell counts are usually not normally distributed, which the t-test assumes. The Wilcoxon rank-sum test is also less sensitive to outliers.
The following are canonical markers that mark known cell types in this dataset. These can be used to identify what cell types are present and how they correspond to clusters.
| Cell Type | Markers |
| --------- | ------- |
| CD4 T Cells | IL7R |
| LYZCD14+ Monocytes | CD14, LYZ |
| B cells | MS4A1 |
| CD8 T cells | CD8A |
| FCGR3A+ Monocytes | FCGR3A, MS4A7 |
| NK cells | GNLY, NKG7 |
| Dendritic Cells | FCER1A, CST3 |
| Megakaryocytes | PPBP |
genepattern.GPUIBuilder(
analysis.visualize_top_markers,
function_import='analysis.visualize_top_markers',
name='Show Cluster Marker Heatmap',
description='This step outputs an interactive interface to explore gene expression in clusters of cells.',
parameters={
'output_var': {
'hide': True
}
}
)
genepattern.GPUIBuilder(
analysis.visualize_marker_expression,
function_import='analysis.visualize_marker_expression',
name='Visualize Marker Expression',
description='This step outputs an interactive interface to explore gene expression in clusters of cells.',
parameters={
'output_var': {
'hide': True
}
}
)
genepattern.GPUIBuilder(
analysis.compare_clusters,
function_import='analysis.compare_clusters',
name='Compare Gene Expression Between Clusters',
description='This step outputs an interactive volcano plot comparing genes between different clusters of cells.',
parameters={
'output_var': {
'hide': True
}
}
)
X.csv
- The preprocessed expression matrix of cells (rows) by genes (columns). This only includes variable genes, usually a much smaller subset of genes compared to the raw counts.obs.csv
- Cell annotations including # of genes, % mitochondrial genes, and cluster assignmentsobsm.csv
- Coordinates of cells in various dimensional reduction spaces (e.g., PCA, t-SNE).var.csv
- Gene annotations (of variable genes) including the # of cells, mean expression, dispersion, and normalized dispersion statistics.varm.csv
- Loadings of cells in various dimensional reduction spaces (e.g., PCA, t-SNE).
uns/pca_variance_ratio.csv
- % variance explained by each principal component.uns/rank_genes_groups_gene_names.csv
- Names of the top ranked marker genes for each cluster.uns/rank_genes_groups_gene_scores.csv
- z-scores of the top ranked marker genes for each cluster.Using the cell below, export results as a series of .csv
files or compressed .h5ad
file. The files will be created and saved to the chosen local data folder.
After creating, you may want to upload to a ScienceData folder.
dataSavePath = 'data/analysis'
dataUploadFolderUrl = 'https://sciencedata/files/tmp/'
genepattern.GPUIBuilder(
analysis.export_data,
function_import='analysis.export_data',
name='Export Analysis Data',
description='Export data as a series of .csv files or compressed .hda5 file.',
parameters={
'path': {
'name': 'filepath',
'description': 'Name of directory where file(s) will be saved. Exporting as an h5ad file produces a single file output.',
'default': dataSavePath
},
'h5ad': {
'name': 'file format',
'description': 'Choose to save either as .csv files or as a single compressed .h5ad-formatted hdf5 file. It is recommended to export .csv files unless you know what you are doing.',
'default': False,
'choices': {
'Comma-separated values (.csv)': False,
'H5AD file (HDF5 file in the AnnData formatting convention)': True
},
},
'output_var': {
'hide': True
}
}
)
# Evaluate to create a zip archive and upload it to ScienceData
dirName = os.path.basename(dataSavePath.rstrip('/'))
parentName = os.path.dirname(dataSavePath.rstrip('/'))
os.system("cd "+parentName+"; if [ -d '"+dirName+"' ]; then zip -r "+dirName+".zip "+dirName+"; fi; curl --insecure --upload "+dirName+".zip "+dataUploadFolderUrl.rstrip('/')+'/'+dirName+".zip")
This analysis employs a scRNA-seq gene expression dataset for 2700 peripheral blood mononuclear cells (PBMCs) from a healthy donor as a demonstration of its use. We are able to recapitulate cell types; the clusters can be characterized by visualizing the expression of canonical markers of these cell types on the 2D t-SNE projection plot as done in Step 3 and Step 4.
In Step 4, the visualization of cluster marker expression helps us understand why some cells we identify visually as a single cluster are divided into multiple. For example, LYZ is overexpressed in a cloud of samples that clustering separates as two distinct clusters. The LYZ gene encodes for human lysozyme, an antimicrobial agent associated with blood monocytes. Using the cluster comparison tool, we can see that one of the subclusters exhibits high relative expression of CD14 while the other exhibits high relative expression of FCGR3A, also known as the CD16 receptor gene. These two genes characterize two known subtypes of blood monocytes respectively; classical and non-classical monocytes.